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Abstract 

Background: Y-Short Tandem Repeats (Y-STR) data consist of many similar and almost similar objects. This 
characteristic of Y-STR data causes two problems with partitioning: non-unique centroids and local minima 
problems. As a result, the existing partitioning algorithms produce poor clustering results. 

Results: Our new algorithm, called /(-Approximate Modal Haplotypes (/c-AMH), obtains the highest clustering 
accuracy scores for five out of six datasets, and produces an equal performance for the remaining dataset. 
Furthermore, clustering accuracy scores of 100% are achieved for two of the datasets. The /c-AMH algorithm records 
the highest mean accuracy score of 0.93 overall, compared to that of other algorithms: /c-Population (0.91), 
/c-Modes-RVF (0.81), New Fuzzy /(-Modes (0.80), /(-Modes (0.76), /(-Modes-Hybrid 1 (0.76), /(-Modes-Hybrid 2 (0.75), 
Fuzzy /(-Modes (0.74), and /(-Modes-UAVM (0.70). 

Conclusions: The partitioning performance of the /c-AMH algorithm for Y-STR data is superior to that of other 
algorithms, owing to its ability to solve the non-unique centroids and local minima problems. Our algorithm is also 
efficient in terms of time complexity, which is recorded as 0(km(n-k)) and considered to be linear. 
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Background 

Y-Short Tandem Repeats (Y-STR) data represent the num- 
ber of times an STR motif repeats on the Y-chromosome. 
It is often called the allele value of a marker. For ex- 
ample, if there are eight allele values for the DYS391 
marker, the STR would look like the following frag- 
ments: [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] 
[TCTA] [TCTA] [TCTA]. The number of tandem 
repeats has effectively been used to characterize and 
differentiate between two people. 

In modern kinship analyses, the Y-STR is very useful 
for distinguishing lineages and providing information 
about lineage relationships [1]. Many areas of study, in- 
cluding genetic genealogy, forensic genetics, anthropo- 
logical genetics, and medical genetics, have taken 
advantage of the Y-STR method. For example, it has 
been used to trace a similar group of Y-surname projects 
to support traditional genealogical studies, e.g., [2-4]. 
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Further, in forensic genetics, the Y-STR is one of the pri- 
mary concerns in human identification for sexual assault 
cases [5], paternity testing [6], missing persons [7], 
human migration patterns [8], and the reexamination of 
ancient cases [9]. 

From a clustering perspective, the goal of partitioning Y- 
STR data is to group a set of Y-STR objects into clusters 
that represent similar genetic distances. The genetic dis- 
tance of two Y-STR objects is based on the mismatch 
results from comparing the Y-STR objects and their modal 
haplotypes. For Y-surname applications, if two people 
share 0, 1, 2, and 3 allele value mismatches for each mar- 
ker, they are considered to be the most familially related. 
Furthermore, for Y-haplogroup applications, the number 
of mismatches is variant and greater than that typically 
found in Y-surname applications. This is because the hap- 
logroup application is based on larger family groups 
branched out from the same ancestor, covering certain 
geographical areas and ethnicities throughout the world. 
The established Y-DNA haplogroups named by the letters 
A to T, with further subdivisions using numbers and lower 
case letters, are now available for reference (see [10] and 
[11] for details). 
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Efforts to group Y-STR data based on genetic distances 
have recently been reported. For example, Schlecht et al. 
[12] used machine learning techniques to classify Y-STR 
fragments into related groups. Furthermore, Seman et al. 
[13-19] used partitional clustering techniques to group 
Y-STR data by the number of repeats, a method used in 
genetic genealogy applications. In this study, we continue 
efforts to partition the Y-STR data based on the parti- 
tional clustering approaches carried out in [13-19]. Re- 
cently, we have also evaluated eight partitional clustering 
algorithms over six Y-STR datasets [19]. As a result, we 
found that there is scope to propose a new partitioning 
algorithm to improve the overall clustering results for 
the same datasets. 

A new partitioning algorithm is required to handle 
the characteristics of Y-STR data, thus producing better 
clustering results. Y-STR data are slightly unique com- 
pared to the common categorical data used in [20-25]. 
The Y-STR data contain a higher degree of similarity of 
Y-STR objects in their intra-classes and inter-classes. 
(Note that the degree of similarity is based on the mis- 
match results when comparing the objects and their 
modal haplotypes.) For example, many Y-STR surname 
objects are found to be similar (zero mismatches) and 
almost similar (1, 2, and 3 mismatches) in their intra- 
classes. In some cases, the mismatch values of inter- 
class objects are not obviously far apart. Y-STR hap- 
logroup data contain similar, almost similar, and also 
quite distant objects. Occasionally, the Y-STR hap- 
logroup data may include sub-classes that are sparse in 
their intra-classes. 

Partitional clustering algorithms 

Classically, clustering has been divided into hierarchical 
and partitional methods. The main difference between 
the two is that the hierarchical method breaks the data 
up into hierarchical clusters, whereas the partitional 
method divides the data into mutually disjoint partitions. 
The pillar of the partitional algorithms is the A"-Means 
algorithm [26], introduced almost four decades ago. As a 
consequence, the £-Means paradigm has been extended 
to various versions, including the A:-Modes algorithm 
[25] for categorical data. 

The /c-Modes algorithm owes its existence to the inef- 
fectiveness of the ^-Means algorithm for handling cat- 
egorical data. Ralambondrainy [27] attempted to rectify 
this using a hybrid numeric-symbolic method based on 
the binary characters 0 and 1. However, this approach 
suffered from an unacceptable computational cost, par- 
ticularly when the categorical attributes had many cat- 
egories. Since then, a variety of /c-Modes-type algorithms 
have been introduced, such as /c-Modes with new dis- 
similarity measures [21,22], /c-Population [23], and a new 
Fuzzy £-Modes [20]. 



Partitional algorithms use an objective function in their 
optimization process, and the determination of this func- 
tion was described as the P problem by Bobrowski and 
Bezdek [28] and Salim and Ismail [29] . When he proposed 
the /c-Modes clustering algorithm, Huang [25] split P into 
P 1 and P 2 - Pi denotes the minimization problem of 
obtaining values for the partition matrix wu of 0 or 1 
(for the hard clustering approach) or 0 to 1 (for the fuzzy 
clustering approach); see Eq. (lb) as an example. Further- 
more, P2 denotes the minimization problem of obtaining 
the value that occurs most often (or the mode of a cat- 
egorical data set) to represent the center of a cluster (often 
called the centroid). The minimization of P2 by obtaining 
the appropriate mode essentially causes the minimization 
of problem P 2 , and vice versa. As an example of the 
optimization process for problem P in the Fuzzy /c-Modes 
algorithm, we wish to solve Eq. (1) subject to Eqs. (la), 
(lb), and (lc). 

p(w,z) = £ti£i>?< x » z <) W 

subject to: 

zZi=i m = 1, !^«; (la) 
W/ie[0,l],l</<«,l</<A- (lb) 
And 

0 < y^"_,wu < n,l<l<k (lc) 
where: 

• wu is a (k x ri) partition matrix that denotes the 
degree of membership of object i in the /th cluster 
that contains a value of 0 to 1, 

• k (< n) is a known number of clusters, 

• Z is the centroid such that [Z h Zj,. . .,ZiJ e R mk , 

• a e [1, co) is a weighting exponent, 

• d{X it Zi) is the distance measure between the object 
Xi and the centroid Z/, as described in Eqs. (2) and (2a). 

d{x,z) = J2" =1 S(x h Zj) (2) 
where: 

Huang and Ng [24] described the optimization process 
of Pj and P 2 as follows: 

• Problem P/. Fix Z = Z and solve the reduced problem 
P(W, Z) as in Eq. (3). This process obtains the 
minimized values of 0-1 of the partition matrix Wg. 
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IfXi*Z[, andXi*X h l<h<k 

(3) 



Problem P 2 : Fix W = W and solve the reduced 
problem P(W, Z) as in Eq. (4) subject to Eq. (4a). 
This process obtains the most frequent attributes, or 
the modes, which give the centroids. 



Z u = a'j p) eDOM(Aj) 



(4) 



where: 



V. 

Z — si.x 



Ewf V /, 1 < t < m 1 < < m 
l,X (A " 



(4a) 



and a e [1, °°) is a weighting exponent. 



Problem of partitioning Y-STR data 

Due to the characteristics of Y-STR data, there are two 
optimization problems for existing partitional algo- 
rithms: non-unique centroids and local minima pro- 
blems. These two problems are caused by the drawback 
of the modes mechanism of determining the centroids. 
Non-unique centroids would result in empty clusters, 
whereas the local minima problem leads to poorer 



clustering results. Both problems are a result of the 
obtained centroids, which are not sufficient to represent 
their classes. 

Therefore, problems will occur for the following two 
cases: 

i)The total number of objects in a dataset is small 
while the number of classes is large. To illustrate this 
case, consider the following example. 

Example I: Figure 1 shows an artificial example of a 
dataset consisting of nine objects in three classes: Class 
A = {A h Az A 3 }, Class B = {B h B z B 3 }, and Class C = 
{Cj, C2, C 3 }. Each object is composed of three attributes, 
represented in lower case; e.g., for object A lt the attri- 
butes are di, CI2, and a 3 . The dataset is considered to 
have a higher degree of similarity between objects in 
intra-classes, while the number of objects is small and 
number of classes is large. Thus, the appropriate modes 
for representing the classes are: Class A — Y&i, &2> & 3 \ , 
lass B - [a j, bz c 3 ], and Class C - \b b €2, d.4]. However, 
attribute «j in DOMAIN (Aj), a 2 in DOMAIN (A 2 ), and 
c 3 in DOMAIN (A 3 ) are too dominant, and would there- 
fore dominate the process of updating P 2 . Figure 2 
shows the possibility that each cluster is formed by the 
dominant attributes. 

As a result, the mode that consists of [di, c 3 ] 
would be obtained twice. Thus, P 2 would not be mini- 
mized due to this non-unique centroid. Another possi- 
bility is that the two modes are different, but are not 
distinctive enough to represent their clusters, such as 
modes [a lt a 3 ] or \a h b 3 ] for Cluster 2. As a 
consequence, this case would fall into a local minima 
problem. 
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Figure 1 Artificial Example 1. An example of higher degree of similarity between objects. 
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Cluster 1 Cluster 2 Cluster 3 

A-j ~ Q^,Q 2 ,di 3 ^2 ~ ^1'^2'^3 1 ™ 9 7»^2'^3 

B 3 - df,b 2 ,c 3 A 3 - a 1t a 2 ,b 3 C 3 - b 1t c 2 ,d 3 

C-f ~ b^,Q 2 , c 3 B 2 - a^,b 2 , c 3 C2 ~ b^,c 2 , d 3 



s 



Empty clusters f ' ' * °i, °2, a s 



Local minima >°i, °2, b 3 



Mode: a f a 2 c 3 a 1 a 2 c 3 b 1 c 2 d 3 

~yi 



Figure 2 The dominant attributes form centroid 1 (o„ c 3 ), centroid 2 (a,, a 2 , c 3 ) and centroid 3 (6„ c 2 , d 3 ). In this case, there are 
possibilities that each cluster is formed by the dominant attributes, e.g. attribute a h a 2 and c 3 This scenario of non-unique centroids would result 
in empty clusters; otherwise the centroids would lead to local a minima problem and produce poorer clustering results. 

-. ) 



ii)An extreme distribution of objects in a class. To 
illustrate this case, consider the following example. 

Example II: Figure 3 shows a dataset consisting of eight 
objects in two classes: Class A = {Aj, A^, A^ A4, A$, A 6 } 
and Class B = {B x , B 2 }- Each object consists of three attri- 
butes, again represented in lower case. The appropriate 
modes to represent the classes are: Class A 
and Class B - \a h c 3 ] or [a lt d 3 ] . The distribution of 
objects in Class A is considerably larger than in Class B, 



covering approximately 75% of the total set of objects. 
This characteristic of the data is found to be problematic 
for P 2 , particularly for the fuzzy approach. The problem is 
actually caused by the initial centroid selection. Figure 4 
shows the objects in Class A would be equally distributed 
into clusters 1 and 2. 

As a result, object A becomes dominant in both clusters, 
and so the obtained modes might be represented solely by 
objects in Class A, e.g., [a lt a^ a 3 ] and \a lt b 3 \. 
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Figure 3 Artificial Example 2. An example of the extreme distribution of objects in a class. 
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Figure 4 The extreme distribution of objects A forms centroid 1 (a,, a 2 , a 3 ) and centroid 2 [a,, a 2 , b 3 ). In this case, the objects in Class A 


are equally distributed into clusters 1 and 2. Therefore, the obtained centroids are not sufficient to represent their classes. 



The above situations cause P not to be fully optimized, 
thus producing poor clustering results. Therefore, a new 
algorithm with a new concept of P 2 is proposed in order 
to overcome these problems and improve the clustering 
accuracy results of Y-STR data. 

Methods 

The center of a cluster 

The mode mechanism for the center of a cluster (problem 
P 2 ) is not appropriate for handling the characteristics of 
Y-STR data, and therefore, it cannot be used as a mechan- 
ism to represent the center of a cluster (centroid). Instead, 
the center of Y-STR data should be the modal haplotypes, 
which are required to calculate the distance of Y-STR 
objects. The distance between a Y-STR object and its 



modal haplotype can be formalized as in Eq. (5) subject to 
Eq. (5a). 

d ystr (X,H) = X^M/) (5) 
subject to: 

where m is the number of markers. 

The modal haplotype is controlled by groups of 
objects that are similar or almost similar in Y-STR data. 
The similar and almost similar objects have a lower dis- 
tance, or a higher degree of membership values in a 
fuzzy sense. Thus, these two groups are considerably the 



Table 1 Example of dominant objects 
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most dominant objects required to find the Approximate 
Modal Haplotype. Consider four objects Xj, x% %, and 
x 4 and two clusters c j and c 2 - The membership value for 
each object and its cluster are as shown in Table 1, 
whereby objects Xi and X3 have a 100% chance of being 
the most dominant object in cluster c lt but only a 50% 
chance of being the dominant object in cluster C2, and 
so on. A dominant weighting value of 1.0 is given to any 
dominant object and a weight of 0.5 is given to the 
remaining objects. 

The fc-AMH algorithm 

Let X ={Xj, X 2 ,. ■ ■, X„} be a set of n Y-STR objects and 
A ={Aj,A 2 ,- ■ ■, A m } be a set of markers (attributes) of a 
Y-STR object. Let H = {H h H 7 ,..,H& e X be the set of 
Approximate Modal Haplotypes for k clusters. Sup- 
pose k is known a priori. Let Hi be the Approximate 
Modal Haplotype, represented as [hij, hi i2l - ■ -,hi, m ], 
and therefore, Hy = Xq for 1< j < m and 1< i < n. 
The objective of the algorithm is to partition the cat- 
egorical objects X into k clusters. Thus, the Hi can be 
replaced by X t until n provided they satisfy the condi- 
tion described in Eq. (6). 

p(aJ > p(Xj ,s * t; Vt, 1 < t < (n-k). 

(6) 

Here, P(A) is the cost function described in Eq. (7), which 
is subject to Eqs. (7a), (8), (8a), (8b), (9), (9a), (9b), and (9c). 

K x ) - p> 

subject to: 

An = W£D U (7a) 

• is a {k x n) partition matrix that denotes the 

degree of membership of Y-STR object 2 in the /th 
cluster that contains a value of 0 to 1 as described in 
Eq. (8), subject to Eqs. (8a) and (8b). 



• subject to: 

w% e [0,1], !</<«, 1 < I < k, (8a) 



and 

0 < Ei>« < n, l<l<k (8b) 
where, 

• k (< n) is a known number of clusters. 

• if is the Approximate Modal Haplotype (centroid) 
such that [H b H^. . .,HjJ e X. 

• a. e [1, 00) is a weighting exponent and used to 
increase the precision of the membership degrees. 
Note that this alpha is typical based on 1.1 until 2.0 
as introduced by Huang and Ng [24]. 

• d ystr (XjHi) is the distance measure between the Y-STR 
object Xi and the Approximate Modal Haplotype Hi 
as described in Eq. (5) and subject to Eq.(5a). 

• Dn is another (k x n) partition matrix which 
contains a dominant weighting value of 1.0 or 0.5, as 
explained above (See Table 1). The dominant 
weighting values are based on the value of Wu 
above. D ti is described in Eq. (9), subject to 

Eqs. (9a), (9b), and (9c). 

d = f 1.0, if wf t = max""' 1 5 ' s k 
\ 0.5, otherwise 

(9) 

subject to: 

du e {1,0.5}, 1 < i < n, 1 < Ik (9a) 



1.5 < J2li d " s k > 1 - 1 - n ( % ) 

1.5 < Y^i=i du - 1 - 1 - k ( 9c ) 



f 1 ' 

0, 



k r 



7E 



dy S tr (XiHi] 
dystr (/Q j Hz 



//, X, = Hi \ 
If, X t = H z ,z * I 



If Hi * Xj and Xi * H z , 1 < z < k 



(8) 
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The basic idea of the k-AMH algorithm is to find /c 
clusters in n objects by first randomly selecting an ob- 
ject to be the Approximate Modal Haplotype h for 
each cluster. The next step is to iteratively replace the 
objects x one-by-one towards the Approximate Modal 
Haplotype h. The replacement is based on Eq. (6) if 
the cost function as described in Eq. (7) and subject to 
(7a), (8), (8a), (8b), (9), (9a), (9b) and, (9c) is maxi- 
mized. Thus, the differences between the k-AMH algo- 
rithm and the other A:-Mode-type algorithms are as 
follows. 

i. The objects (the data themselves) are used as the 
centroids instead of modes. Since the distance of 
Y-STR objects is measured by comparing the 
objects and their modal haplotypes, we need to 
approximately find the objects that can represent 
the modal haplotypes. In finding the final 
Approximate Modal Haplotype for a particular 
group (cluster), each object needs to be tested one- 
by-one and replaced on a maximization of a cost 
function. 

ii. A maximization process of the cost function is 
required instead of minimizing it as in the /r-mode- 
type algorithms. 

A detailed description of the /c-AMH algorithm is 
given below. 

Step 1 - Select k initial objects randomly as 

Approximate Modal Haplotype (centroids). 
E.g. if k = 4, then choose randomly 4 
objects as the initial Approximate Modal 
Haplotype. 

Step 2 - Calculate distance d ystr {XiHj) according to 

Eq. (5) and subject to (5a). 
Step 3 - Calculate partition matrix w'u according to 

Eq. (8), subject to Eqs. (8a) and (8b). Note 

that the u>u is based on the distance 

calculated in Step 2. 
Step 4 - Assign a weighting dominant of 1.0 or 0.5 for 

partition matrix D u according to Eqs. (9), (9a), 

(9b) and (9c). 

Step 5 - Calculate cost function P{A) based on W^Du 
according to Eqs (7) and (7a). 

Step 6 - Test for each initial modal haplotype by the 
other objects one-by-one. If current cost 
function is greater than previous cost function 
according to Eq. (6), then replace it. 

Step 7 - Repeat Step 2 until Step 6 for each x and h 

Step 8 - Once the final Approximate Modal 

Haplotypes are obtained for all clusters, assign 
the objects to their corresponding crisp clusters 
Cn according to Eq. (10). 



£ ( 1, if / = arg max , 1 < j < c 

ll [0, otherwise 

(10) 

Furthermore, the implementation of the steps above of 
the algorithm is formalized in the form of pseudo-code 
as follows. 

INPUT: Dataset, X, the number of cluster, k, the num- 
ber of dimensional, d and the fuzziness index, 
OUTPUT: A set of clusters, k 

01: Select Hi randomly from X such that 1<1< k 

02: for each Hi an Approximate Modal Haplotype do 

03: for each X,- do 

04: Calculated) = £ / = i E ?= iA u 

05: if P(A) = £ l = i E t= \Au is maximized, then 

06: Replace H t by X t 

07: end if 

08: end for 

09: end for 

10: Assign X t to Q for all I, 1< I < k; l<i< n as Eq. (10) 
11: Output Results 

Optimization of the problem P 

In optimizing the problem P, the /c-AMH algorithm uses 
a maximization process instead of the minimization 
process imposed by the /c-Mode-type algorithms. This 
process is formalized in the /c-AMH algorithm as fol- 
lows. 

Step 1 - Choose an Approximate Modal Haplotype, 

//^e X. Calculate P(A); Set t=l 
Step 2 - Choose xf t+1> such that P(A) t+1 is maximized; 

Replace// 1 by X< t+1) 
Step 3 - Set t=t+l; Stop when t=n; otherwise go to Step 2. 
*Note: n is the number of objects 

The convergence of the algorithm is proven as Pj and 
P 2 are maximized accordingly. The function P(A) incor- 
porates the P(W, H) function imposed by the Fuzzy /c- 
Modes algorithm, where Wis a partition matrix and H 
is the approximate modal haplotype that defines the cen- 
ter of a cluster. Thus, P 1 and P 2 are solved by Theorems 
1 and 2, respectively. 

Theorem 1 - Let H be fixed. P(W, ft) is maximized if 
and only if 
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_dystr(XiH z 



(oc-1) 



If, X = Hi \ 
If, X, = H z ,z * I 

, If Hi * Xj and X t * H z , 1 < z < k 



Proof 

Let X= {X lt X2,..,XJ be a set of n Y-STR categorical 
objects and H= {Hj,H2,..,H k } be a set of centroids 
(Approximate Modal Haplotypes) for k clusters. Suppose 
that P= {PfrPz-tPkj is a set of dissimilarity measures 
based on d ystr {XiHi), as described in Eqs. (5) and subject 
to (5a), V i and / 1 < i < n; 1 < I < k 

Definition 1 - For X, = H t and X, = H z , where z * I, 
the membership value for all i is 



1, if Xt = Hi 

0, if Xi = H z , z * I 



For any P that is obtained from d ystr (XiHi) where 
Xi = H b the maximum value of wj[ is 1 and X, = H z , 
z * I the value of Wu is 0. Therefore, because Hi is 
fixed, wj[ is maximized. 

Definition 2 - For the case of H L * Xi and X, * 
H z , V z, 1 < z < k, the membership value for all i is 




dystr 



dystr (Xi H z) 



(oc-l)\ 



where z 



Thus, y k 



< 



7t-i 



where 



t * / and V z and t, 1 < z < k; \ < t < kit follows 
that 



{ ( 



W n = < 



Pli 



Pzi 



> 



V 



p# 



7(-: 



/ 



where i 

Therefore, based on definitions 1 and 2, is max- 
imal. Because ft is fixed, P(W ,H^j is maximized. 

Theorem 2 - Let /z; e X be the initial center of a clus- 
ter for 1 < I < k. hi is replaced by the Approximate 
Modal Haplotype if and only if 



Suppose that pu e P is the minimum value, we write as 



where 1 < I k; 1 < z k 




V(«-i) 



'/(«-!) 



Therefore, 



1 > 



'/(«-» 



»/(«-! 



■(A)'; 



PU > PU ;s * f; Vf, 1 < t < («-Jt). 



Proof 

Let D= {D lf D 2> ..,Dk} be a set of dominant weighting 
values. For any maximum value of Wu as proved by The- 
orem 1, we assign an optimum value of 1.0 as a domin- 
ant weighting value, otherwise 0.5 as described in Eq, (9) 
and subject to Eqs. (9a), (9b) and (9c). We write 

Because Wu and Du are non-negative, the product 
W^Du must be maximal. It follows that the sum of all 
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quantities i= i zC t = i^-u ' s a ^ so maximal. Hence, the 
result follows. 

Y-STR Datasets 

The Y-STR data were mostly obtained from a database 
called worldfamilies.net [30]. The first, second, and third 
datasets represent Y-STR data for haplogroup applica- 
tions, whereas the fourth, fifth, and sixth datasets repre- 
sent Y-STR data for Y-surname applications. All datasets 
were filtered for standardization on 25 similar attributes 
(25 markers). The chosen markers include DYS393, 
DYS390, DYS19 (394), DYS391, DYS385a, DYS385b, 
DYS426, DYS388, DYS439, DYS389I, DYS392, DYS389II, 
DYS458, DYS459a, DYS459b, DYS455, DYS454, DYS447, 
DYS437, DYS448, DYS449, DYS464a, DYS464b, DYS464c, 
and DYS464b. These markers are more than sufficient for 
determining a genetic connection between two people. 
According to Fitzpatrick [31], 12 markers (Y-DNA12 test) 
are already sufficient to determine who does or does not 
have a relationship to the core group of a family. 

All datasets were retrieved from the respective web- 
sites in April 2010, and can be described as follows: 

1) The first dataset consists of 751 objects of the Y-STR 
haplogroup belonging to the Ireland yDNA project 
[32]. The data contain only 5 haplogroups, namely E 
(24), G (20), L (200), J (32), and R (475). Thus, k = 5. 

2) The second dataset consists of 267 objects of the Y- 
STR haplogroup obtained from the Finland DNA 
Project [33]. The data are composed of only 4 
haplogroups: L (92), J (6), N (141), and R (28). Thus, 
k = 4. 

3) The third dataset consists of 263 objects obtained 
from the Y-haplogroup project [34]. The data 
contain Groups G (37), N (68), and T (158). Thus, 
k = 3. 

4) The fourth dataset consists of 236 objects combining 
four surnames: Donald [35], Flannery [36], Mumma 
[37], and William [38]. Thus, k = 4. 



5) The fifth dataset consists of 112 objects belonging to 
the Philips DNA Project [39]. The data consist of 
eight family groups: Group 2 (30), Group 4 (8), Group 
5 (10), Group 8 (18), Group 10 (17), Group 16 (10), 
Group 17 (12), and Group 29 (7). Thus, k = 8. 

6) The sixth dataset consists of 112 objects belonging 
to the Brown Surname Project [40]. The data consist 
of 14 family groups: Group 2 (9), Group 10 (17), 
Group 15 (6), Group 18 (6), Group 20 (7), Group 23 
(8), Group 26 (8), Group 28 (8), Group 34 (7), 
Group 44 (6), Group 35 (7), Group 46 (7), Group 49 
(10), and Group 91 (6). Thus, k = 14. 

The values in parentheses indicate the number of 
objects belonging to that particular group. Datasets 1-3 
represent Y-STR haplogroups and datasets 4-6 represent 
Y-STR surnames. 

Results and discussion 

The following results compare the performance of the k- 
AMH algorithm with eight other partitional algorithms: 
the £-Modes algorithm [25], /c-Modes with RVF [21-22,41], 
A:-Modes with UAVM [21], /c-Modes with Hybrid 1 [21], 
A:-Modes with Hybrid 2 [21], the Fuzzy /c-Modes algo- 
rithm [24], the /c-Population algorithm [23], and the New 
Fuzzy £-Modes algorithm [20]. 

Our analysis was based on the average accuracy scores 
obtained from 100 runs for each algorithm and dataset. 
During the experiments, the objects in the datasets were 
randomly reordered from the preceding run. The mis- 
classification matrix proposed by Huang [25] was used 
to obtain the clustering accuracy scores for evaluating 
the performance of each algorithm. The clustering ac- 
curacy r defined by Huang [25] is given by Eq. (11): 



Table 2 Clustering accuracy scores for all datasets 



ALGORITHM DATASET 



/(-Modes 070 079 0.84 0.84 0.74 0.62 



/c-Modes-RVF 


0.79 


0.83 


0.87 


0.78 


0.87 


0.72 


/(-Modes-UAVM 


0.65 


0.75 


0.83 


0.87 


0.56 


0.54 


/(-Modes-Hybrid 1 


0.67 


0.81 


0.85 


0.77 


0.80 


0.64 


/(-Modes-Hybrid 2 


0.56 


0.82 


0.83 


0.79 


0.81 


0.70 


Fuzzy /(-Modes 


0.56 


0.74 


0.74 


0.97 


0.76 


0.66 


/(-Population 


0.80 


0.90 


0.97 


1.00 


0.97 


0.84 


New Fuzzy /(-Modes 


0.71 


0.84 


0.77 


1.00 


0.77 


0.69 


/c-AMH 


0.83 


0.93 


0.96 


1.00 


1.00 


0.87 
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Table 3 Clustering accuracy scores for all Y-STR datasets 




N 


Mean 


Std. Dev. 


95% Confidence Interval for Mean 


Min 


Max 










Lower Bound 


Upper Bound 






/(-Mode 


600 


0.76 


0.13 


0.75 


0.77 


0.45 


1.00 


/;-Mode-RVF 


600 


0.81 


0.11 


0.80 


0.82 


0.56 


1.00 


/(-Mode-UAVM 


600 


0.70 


0.17 


0.69 


0.71 


0.38 


1.00 


/(-Mode-Hybrid 1 


600 


0.76 


0.13 


0.75 


0.77 


0.38 


1.00 


/(-Mode-Hybrid 2 


600 


0.75 


0.14 


0.74 


0.76 


0.45 


1.00 


Fuzzy /(-Mode 


600 


0.74 


0.16 


0.73 


0.75 


0.32 


1.00 


/(-Population 


600 


0.91 


0.09 


0.91 


0.92 


0.59 


1.00 


New Fuzzy /(-Mode 


600 


0.80 


0.13 


0.79 


0.81 


0.44 


1.00 


/c-AMH 


600 


0.93 


0.07 


0.93 


0.94 


0.79 


1.00 



where k is the number of clusters, a ; is the num- 
ber of instances occurring in both cluster i and its 
corresponding haplogroup or surname, and n is 
the number of instances in the dataset. 

Clustering performance 

Table 2 shows the clustering accuracy scores for all data- 
sets (boldface indicates the highest clustering accuracy). 
Based on these results, the performance of the k-AMH 
algorithm was very promising. Out of six datasets, our 
algorithm obtained the highest clustering accuracy 
scores for datasets 1, 2, 4, 5, and 6. In fact, the algorithm 
also achieved the optimal clustering accuracy for two 
datasets (4 and 5). However, for dataset 3, the results 
show that the accuracy of the /c-AMH algorithm was 
0.01 lower than that of the /c-Population algorithm. A 
statistical f-test was carried out for further verification. 
This indicated that i(101.39) = 0.65, and p = 0.51. Thus, 
there was no significant difference at the 5% level be- 
tween the accuracy score of our ir-AMH algorithm and 
the /c-Population algorithm. This means that both algo- 
rithms displayed an equal performance for this dataset. 

During the experiments, the k-AMH algorithm did not 
encounter any difficulties. However, the Fuzzy £-Modes 

Table 4 Multiple comparisons for the k-AMH algorithm 



and the New Fuzzy £-Modes algorithms faced problems 
with datasets 1, 5, and 6. For dataset 1, the problem was 
caused by the extreme number of objects in Class R 
(475), which covered about 63% of the total objects. Fur- 
ther, for datasets 5 and 6, the problem was caused by 
many similar objects in a larger number of classes. In 
particular, both algorithms faced the problem P 2 caused 
by the initial centroid selections. Note also that the 
results for both algorithms were based on the diverse 
method, an initial centroid selection proposed by Huang 
[25]. 

For an overall comparison, Table 3 shows the results of 
all Y-STR datasets. It clearly indicates that the k-AMH 
algorithm obtained the highest accuracy score of 0.93. The 
closest score of 0.91 belongs to the /c-Population algo- 
rithm. Furthermore, the k-AMH algorithm also recorded 
the best results in terms of standard deviation (0.07), the 
lower bound (0.93), the upper bound (0.94), and the mini- 
mum accuracy score (0.79). 

For further verification, a one-way ANOVA test was 
carried out. This indicated that the assumption of 
homogeneity of variance was violated; therefore, the 
Welch F-ratio is reported. There was a significant vari- 
ance in the clustering accuracy scores among the 



Accuracy Games-Howell 



(1) Algorithm 


(J) Algorithm 


Mean Diff. (I-J) 


Std. Error 


p-value 


95% Confidence Interval 












Lower Bound 


Upper Bound 


/c-AMH 


k-Mode 


0.17* 


0.01 


< 0.00001 


0.16 


0.19 




/(-Mode-RVF 


0.12* 


0.01 


< 0.00001 


0.11 


0.14 




/;-Mode-UAVM 


0.23* 


0.01 


< 0.00001 


0.21 


0.25 




/(-Mode-Hybrid 1 


0.17* 


0.01 


< 0.00001 


0.16 


0.19 




/(-Mode-Hybrid 2 


0.18* 


0.01 


< 0.00001 


0.16 


0.20 




Fuzzy /(-Mode 


0.19* 


0.01 


< 0.00001 


0.17 


0.21 




/(-Population 


0.02* 


0.00 


0.00271 


0.01 


0.03 




New Fuzzy /(-Modes 


0.13* 


0.01 


< 0.00001 


0.12 


0.15 



*Note: p < 0.05. 
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nine algorithms, in which F{8, 2230) = 378, p < 0.001, 
and cj 2 = 0.25. Thus, the Games-Howell procedure was 
used for a multiple comparison among the nine algo- 
rithms. Table 4 shows the result of this comparison with 
regard to the /c-AMH algorithm against the other eight 
algorithms. At the 5% level of significance, it is clearly 
shown that the k-AMH algorithm (M = 0.93, 95% CI 
[0.93, 0.94]) differed from the other eight algorithms 
(all P values < 0.001). Thus, the performance of A:-AMH 
algorithm exhibited a very significant difference com- 
pared to the other algorithms. 



Efficiency 

We now consider the time efficiency of the /c-AMH algo- 
rithm. The computational cost of the algorithm depends 
on the nested loop for k(n-k), where k is the number of 
clusters and n is the number of data required to obtain the 
cost function, P(A). The function P(A) involves the num- 
ber of attributes m in calculating the distances and the 
membership values for its partition matrix w/,. Thus, the 
overall time complexity is 0{km{n-k)). However, the time 
efficiency of the k-AMW algorithm will not reach 0(n 2 ) 
because the value of k in the outer loop will not become 



(a) 



1 30000- 
1 20000- 
110000- 

1 ooooo- 

90000- 
80000" 
70000- 
60000" 
50000- 
40000- 
30000" 
20000" 
10000- 



1 1 

No. of cluster 



(b) 



1) 
£ 



20000- 
19000- 
1 8000" 
1 7000- 
1 6000- 
15000" 
14000- 
13000" 
12000" 
1 1 000- 
10000- 
9000- 
3000- 
7000- 
6000- 
5000- 
4000- 
3000- 
2000- 
1000- 



No. of data 

Figure 5 Scalability Testing, a Execution time to cluster 65,000 data into different numbers of clusters, b Execution time to cluster a different 
number of data into three clusters. 
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equivalent to the value of n-k in the inner loop. See 
pseudo-code for a detailed implementation of these loops. 

A scalability test was also carried out for the A:-AMH 
algorithm. These experiments were based on a dataset 
called Connect [42]. This dataset consisted of 65,000 
data, 42 attributes, and three classes. Two scalability 
tests were conducted: (a) scalability against the number 
of objects, when the number of clusters was three, and 
(b) scalability against the number of clusters, when the 
number of objects was 65,000. The test was performed 
on a personal computer with an Intel® Core™ 2 DUO 
Processor with 2.93 GHz and 2.00 GB memory. Figure 5 
(a) and (b) illustrate the results of the tests. In conclu- 
sion, the runtime of the /c-AMH algorithm increased 
linearly with the number of clusters and data. 

Conclusions 

Our experimental results indicate that the performance of 
the proposed k-AMW algorithm for partitioning Y-STR 
data was significantly better than that of the other algo- 
rithms. Our algorithm handled all problems, as described 
previously, and was not too sensitive to P 0 , the initial cen- 
troid selection, even though the datasets contained a lot of 
similar objects. Moreover, the concept of P 2 in using the 
object (the data itself) as the approximate center of a clus- 
ter has significantly improved the overall performance of 
the algorithm. In fact, our algorithm is the most consistent 
of those tested because the difference between the mini- 
mum and maximum scores is smaller. The A-AMH algo- 
rithm always produces the highest minimum score for 
each dataset. In conclusion, the /c-AMH algorithm is an ef- 
ficient method of partitioning Y-STR categorical data. 
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